大型语言模型越来越能够通过相对较少的特定任务的监督产生流畅的出现文本。但这些模型可以准确解释分类决策吗?我们考虑使用少量人写的例子(即,以几滴方式)生成自由文本解释的任务。我们发现(1)创作更高质量的例子,以提示导致更高质量的世代; (2)令人惊讶的是,在头到头比较中,人群公司通常更喜欢GPT-3生成的解释,以众包中包含的人性写入的解释。然而,Crowdworker评级也表明,虽然模型产生了事实,语法和充分的解释,但它们具有改进的空间,例如沿着提供新颖信息和支持标签的轴。我们创建了一种管道,该管道将GPT-3与监督过滤器结合起来,该过滤器通过二进制可接受性判断来包含人类循环。尽管具有重要的主观性内在的判断可接受性,但我们的方法能够始终如一地过滤人类可接受的GPT-3生成的解释。
translated by 谷歌翻译
可解释的NLP(EXNLP)越来越关注收集人类注释的文本解释。这些解释在三种方面使用下游:作为数据增强,以提高预测任务的性能,因为对培训模型的监督,为他们的预测产生解释,以及评估模型生成的解释的理论。在本次审查中,我们识别65个具有三个主要类别的文本解释的数据集(突出显示,自由文本和结构),组织关于注释每种类型的文献,识别现有收集方法的优势和缺点,并为收集EXNLP数据集提供建议在将来。
translated by 谷歌翻译
在可解释的NLP中,我们需要忠实的理由,以反映模型的解释实例的决策过程。虽然先前的工作着重于提取理由(输入词的子集),但我们研究了他们研究较少的对应物:自由文本的自然语言理由。我们证明,对信息摘要样式任务的忠实提取合理化的现有模型的管道并没有可靠地扩展到需要自由文本理性的“推理”任务。我们转向共同预测和合理化的模型,这是一类广泛使用的高性能模型,用于自由文本合理化,尚未确定忠诚。我们将标签理性关联定义为忠诚的必要特性:产生标签和原理的模型的内部机制必须有意义地关联。我们提出了两项​​测量来测试此属性的测量:鲁棒性等效性和特征重要性一致性。我们发现,基于T5的最先进的联合模型表现出合理化常识性提问和自然语言推论的两种属性,表明它们有可能产生忠实的自由文本理性。
translated by 谷歌翻译
Attention mechanisms play a central role in NLP systems, especially within recurrent neural network (RNN) models. Recently, there has been increasing interest in whether or not the intermediate representations offered by these modules may be used to explain the reasoning for a model's prediction, and consequently reach insights regarding the model's decision-making process. A recent paper claims that 'Attention is not Explanation' (Jain and Wallace, 2019). We challenge many of the assumptions underlying this work, arguing that such a claim depends on one's definition of explanation, and that testing it needs to take into account all elements of the model. We propose four alternative tests to determine when/whether attention can be used as explanation: a simple uniform-weights baseline; a variance calibration based on multiple random seed runs; a diagnostic framework using frozen weights from pretrained models; and an end-to-end adversarial attention training protocol. Each allows for meaningful interpretation of attention mechanisms in RNN models. We show that even when reliable adversarial distributions can be found, they don't perform well on the simple diagnostic, indicating that prior work does not disprove the usefulness of attention mechanisms for explainability.
translated by 谷歌翻译
Remote sensing imagery provides comprehensive views of the Earth, where different sensors collect complementary data at different spatial scales. Large, pretrained models are commonly finetuned with imagery that is heavily augmented to mimic different conditions and scales, with the resulting models used for various tasks with imagery from a range of spatial scales. Such models overlook scale-specific information in the data. In this paper, we present Scale-MAE, a pretraining method that explicitly learns relationships between data at different, known scales throughout the pretraining process. Scale-MAE pretrains a network by masking an input image at a known input scale, where the area of the Earth covered by the image determines the scale of the ViT positional encoding, not the image resolution. Scale-MAE encodes the masked image with a standard ViT backbone, and then decodes the masked image through a bandpass filter to reconstruct low/high frequency images at lower/higher scales. We find that tasking the network with reconstructing both low/high frequency images leads to robust multiscale representations for remote sensing imagery. Scale-MAE achieves an average of a $5.0\%$ non-parametric kNN classification improvement across eight remote sensing datasets compared to current state-of-the-art and obtains a $0.9$ mIoU to $3.8$ mIoU improvement on the SpaceNet building segmentation transfer task for a range of evaluation scales.
translated by 谷歌翻译
Anomaly detection on time series data is increasingly common across various industrial domains that monitor metrics in order to prevent potential accidents and economic losses. However, a scarcity of labeled data and ambiguous definitions of anomalies can complicate these efforts. Recent unsupervised machine learning methods have made remarkable progress in tackling this problem using either single-timestamp predictions or time series reconstructions. While traditionally considered separately, these methods are not mutually exclusive and can offer complementary perspectives on anomaly detection. This paper first highlights the successes and limitations of prediction-based and reconstruction-based methods with visualized time series signals and anomaly scores. We then propose AER (Auto-encoder with Regression), a joint model that combines a vanilla auto-encoder and an LSTM regressor to incorporate the successes and address the limitations of each method. Our model can produce bi-directional predictions while simultaneously reconstructing the original time series by optimizing a joint objective function. Furthermore, we propose several ways of combining the prediction and reconstruction errors through a series of ablation studies. Finally, we compare the performance of the AER architecture against two prediction-based methods and three reconstruction-based methods on 12 well-known univariate time series datasets from NASA, Yahoo, Numenta, and UCR. The results show that AER has the highest averaged F1 score across all datasets (a 23.5% improvement compared to ARIMA) while retaining a runtime similar to its vanilla auto-encoder and regressor components. Our model is available in Orion, an open-source benchmarking tool for time series anomaly detection.
translated by 谷歌翻译
We present an update on the current architecture of the Zoea knowledge-based, Composable Inductive Programming system. The Zoea compiler is built using a modern variant of the black-board architecture. Zoea integrates a large number of knowledge sources that encode different aspects of programming language and software development expertise. We describe the use of synthetic test cases as a ubiquitous form of knowledge and hypothesis representation that sup-ports a variety of reasoning strategies. Some future plans are also outlined.
translated by 谷歌翻译
Quantum machine learning (QML) has received increasing attention due to its potential to outperform classical machine learning methods in various problems. A subclass of QML methods is quantum generative adversarial networks (QGANs) which have been studied as a quantum counterpart of classical GANs widely used in image manipulation and generation tasks. The existing work on QGANs is still limited to small-scale proof-of-concept examples based on images with significant down-scaling. Here we integrate classical and quantum techniques to propose a new hybrid quantum-classical GAN framework. We demonstrate its superior learning capabilities by generating $28 \times 28$ pixels grey-scale images without dimensionality reduction or classical pre/post-processing on multiple classes of the standard MNIST and Fashion MNIST datasets, which achieves comparable results to classical frameworks with 3 orders of magnitude less trainable generator parameters. To gain further insight into the working of our hybrid approach, we systematically explore the impact of its parameter space by varying the number of qubits, the size of image patches, the number of layers in the generator, the shape of the patches and the choice of prior distribution. Our results show that increasing the quantum generator size generally improves the learning capability of the network. The developed framework provides a foundation for future design of QGANs with optimal parameter set tailored for complex image generation tasks.
translated by 谷歌翻译
Datasets for training recommender systems are often subject to distribution shift induced by users' and recommenders' selection biases. In this paper, we study the impact of selection bias on datasets with different quantization. We then leverage two differently quantized datasets from different source distributions to mitigate distribution shift by applying the inverse probability scoring method from causal inference. Empirically, our approach gains significant performance improvement over single-dataset methods and alternative ways of combining two datasets.
translated by 谷歌翻译
There has been great recent advancement in human-computer chat. However, proper evaluation currently requires human judgements that produce notoriously high-variance metrics due to their inherent subjectivity. Furthermore, there is little standardization in the methods and labels used for evaluation, with an overall lack of work to compare and assess the validity of various evaluation approaches. As a consequence, existing evaluation results likely leave an incomplete picture of the strengths and weaknesses of open-domain chatbots. We aim towards a dimensional evaluation of human-computer chat that can reliably measure several distinct aspects of chat quality. To this end, we present our novel human evaluation method that quantifies the rate of several quality-related chatbot behaviors. Our results demonstrate our method to be more suitable for dimensional chat evaluation than alternative likert-style or comparative methods. We then use our validated method and existing methods to evaluate four open-domain chat models from the recent literature.
translated by 谷歌翻译